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Abstract 

Intellectual  Property  (IP)  libraries  are  commonly  used  by  hardware  designers  to  increase  produc¬ 
tivity  and  reduce  the  time-to-market.  These  static  IP  libraries  do  not  allow  the  designers  flexibility  in 
customizing  trade-offs.  We  propose  a  parameterized  DSP  IP  generator  that  allows  designers  to  specify 
the  cost/performance  tradeoff.  We  present  a  prototype  implementation  of  a  parameterized  DFT  generator 
and  compare  our  generated  DFT  with  Xilinx  Logicore’s  DFT  IP  Core.  Our  results  show  that  we  generate 
high-quality  DFT  blocks  that  match  the  performance  and  cost  of  Xilinx  LogiCore  DFT  implementations. 

More  importantly,  we  show  that  our  parameterized  design  generation  yields  customized  DFT  blocks  over 
a  range  of  different  performance/cost  tradeoff  points. 

Introduction.  We  propose  a  parameterized  IP  generator  as  an  alternative  to  static  IP  blocks.  The  genera¬ 
tor  is  tailored  for  application-specific  tradeoffs,  such  as  area,  performance,  numerical  accuracy  and  power 
consumption.  Our  approach  preserves  the  advantage  of  using  static  IP  blocks,  while  allowing  the  design¬ 
ers  more  control  over  the  design.  This  generator  can  be  used  together  with  a  search  engine  to  find  the 
best  possible  implementation  for  a  given  set  of  constraints.  Here  we  present  our  experience  in  developing 
a  parameterized  generator  for  discrete  Fourier  transform  (DFT).  A  full  description  of  this  work  has  been 
submitted  to  a  conference. 

Generation  of  Discrete  Fourier  Transform.  Our  DFT  generator  is  based  on  the  Pease  algorithm  for  the 
DFT,  which  we  express  in  a  formula  notation  as 


n—  1 

F2n  =  {J]Lf(I2„-i®F2)Tn_aR2n,  F2  =  (}_})  (1) 

i= o 

where  ‘I’  denotes  an  identity  matrix,  ‘(g)’  the  Kronecker  product  of  matrices,  ’T’  denotes  the  Twiddle  factors, 
‘R’  denotes  the  bit  reversal,  and  ‘L’  a  stride  permutation.  Figure  1  shows  a  dataflow  representation  of  (1) 
for  n  =  3.  This  formula-derived  dataflow  graph  can  be  directly  mapped  to  a  combinational  circuit  where  the 
implementation  cost  is  approximately  nlog(n)/2  C  blocks,  plus  the  routing  cost  of  realizing  the  L2  wire 
permutations.  The  cost  of  a  combinational  implementation  is  usually  very  large  and  unrealistic  to  implement 
for  large  n.  A  common  practical  DFT  implementation  requires  a  sequential  implementation  where  the  logic 
resources,  e.g.,  C,  are  reused  multiple  times  by  horizontal  folding  or  vertical  folding.  Figure  1  shows,  for 
n  =  3,  block  diagrams  of  a  horizontally  folded  DFT  (middle)  and  a  horizontally  and  vertically  folded  DFT 
(right). 

Our  DFT  generator  accepts  as  input  parameters  the  DFT  size,  the  data  format  (i.e.,  fixed-point  number 
range  and  precision),  and  a  design  parameter  p  that  controls  the  degree  of  parallelism  in  the  generated  imple¬ 
mentation.  This  freedom  allows  the  designer  to  select  a  custom  tradeoff  between  minimizing  cost  (i.e.,  area 
and  power)  and  maximizing  performance  (i.e.,  latency  and  throughput).  Our  DFT  generator  can  also  accept 
target-specific  parameters  to  reflect  the  designer’s  preference  for  different  classes  of  resources.  A  parameter 
that  our  DFT  generator  allows  is  a  relative  value  for  a  Block  Select-RAM  (BRAM,  a  specialized  memory 
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Slices 


Figure  1:  Pease  DFT  algorithm.  From  left  to  right:  completely  flattened,  horizontally  folded,  fully  horizon¬ 
tally  and  vertically  folded. 


Number  of  Slices  for  DFT64 


P 


Number  of  BRAMs  for  DFT64 


P 


Throughput  of  DFT64 


P 


—♦—Min  Slice 


Figure  2:  Synthesis  results  for  F64:  slice  utilization,  BRAM  utilization,  and  transform  throughput  (trans¬ 
form/second,  overlapping  loading  and  unloading). 


Table  1 :  Parameters  for  DFT  IP  Generator  and  the  corresponding  effects  on  logic  slices,  BRAM  and  through¬ 
put  as  each  parameter  increases 

Parameter  Logic  Slices  BRAM  Throughput 


n  (transform  size)  "ft 

p  (parallelism)  ft 

fixed-point  number  range  and  precision  -ft- 
relative  value  of  BRAM  cost  1) 


ft  ft 

ft  ft 

ft 
ft 


primitive)  in  terms  of  slices  (generic  logic  building  blocks).  The  DFT  generator  takes  this  preference  into 
account  to  balance  resource  minimization  across  BRAM  and  logic  slice  utilization.  Table  1  lists  some  cur¬ 
rent  parameters  that  our  DFT  generator  currently  supports,  and  the  corresponding  effects  on  BRAM,  logic 
slice  utilization  and  throughput.  The  output  of  our  generator  is  an  RTL-level  Verilog  description  of  the 
desired  DFT  implementation. 

Sample  Result.  For  n  =  6,  our  DFT  generator  produces  6  implementations  representing  different  trade¬ 
offs  between  the  different  design  goals  and  constraints.  Figure  2  shows  the  resource  utilization,  in  terms  of 
slices  and  BRAM,  and  throughput  over  these  6  design  choices,  compared  against  the  latest  Xilinx  LogiCore 
DFT  implementations.  The  generated  DFT  implementations  are  synthesized  for  the  Xilinx  Virtex2-Pro 
XC2VP100-6FF1696  FPGA  using  Xilinx  ISE  version  6.1.03i.  To  show  the  effects  of  our  Xilinx  Virtex2- 
Pro-specific  parameters,  each  graph  reports  two  separate  results  corresponding  to  the  extreme  tradeoff  points 
of  slices  and  BRAM  utilization.  They  are  1)  minimize  the  use  of  slices;  and  2)  minimize  the  use  of  BRAMs. 
As  the  graphs  show,  our  minimum  resource  design  points  (i.e.,  p  =  1  or  p  =  2)  occupy  a  similar  tradeoff 
space  as  the  Xilinx  LogiCore  DFT  implementations.  By  varying  p  and  the  relative  value  of  BRAM,  the 
designer  can  customize  the  tradeoff  function  between  performance,  slice,  and  BRAM  usage.  For  larger  p 
values,  our  generated  DFT  implementations  can  offer  a  higher  throughput  at  the  cost  of  an  increased  resource 
requirements. 
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The  Paradox  of  Reusable  IPs 


♦  Boon  to  productivity 

-  zero  effort  required 

-  zero  knowledge  required 

-  zero  chance  to  introduce  new  bugs 

Why  repeat  what  is  already  been  done? 

♦  Bane  to  optimality 

-  finding  the  right  functionality  with  the  right  interface 

-  design  tradeoff--  performance,  area,  power,  accuracy . 

Are  you  getting  what  you  really  wanted? 

♦  Solution:  parameterized  automatic  IP  generators 

-  zero  effort,  knowledge  or  bugs 

-  allows  application  specific  customization 

-  facilitates  design  exploration 
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Discrete  Fourier  Transform  IPs 


♦  Discrete  Fourier  Transform  (DFT) 

-  important  building  block  in  DSP  applications 

-  numerous  design  “cores”  available 

♦  Some  commonly  supported  options  in  IP  libraries 

-  transform  sizes 

-  number  format 

-  i/o  data  ordering 

-  a  small  number  of  microarchitecture  choices  (e.g.,  min  area, 
max  speed) 

♦  Customized  design  tradeoff  in  our  generated  IPs 

-  degree  of  parallelism  in  microarchitecture  (min  <^max) 

-  resource  preference  (e.g.  BRAM  vs.  LUT in  FPGAs) 

Extensible  to  other  common  linear  DSP  transforms 
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Outline 


♦  Introduction 

♦  Formula-Driven  Design  Generation 

♦  Microarchitecture  Parameterization 

♦  Resource  Parameterization 

♦  Experimental  Results 

♦  Conclusions 
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Transforms  as  Formulas  [www.spirai.net] 


Transform 

Recursion 

Algorithm 
As  T  ree 

Algorithm 
As  Formula 
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DFTn  parameterized  matrix 


DFT  ->(. DFT  ®I  )-D-(l  ®DFT  )-P 

nm  V  n  m  /  \n  m  / 

•  a  breakdown  strategy 

•  product  of  sparse  matrices 


DFT  s 


DFT  2 


DFT  a 


DFT  2  DFT  2 


•  recursive  application  of  rules 

•  uniquely  defines  an  algorithm 

•  efficient  representation 

•  easy  manipulation 


dft%  =  (f2  ®/4)-d-(/2  0((F2  ®/2)-))-E 

•  few  constructs  and  primitives 

•  uniquely  defines  an  algorithm 

•  can  be  translated  into  code 
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Formula  to  Datapath 


♦  Given  M  *x  where  M  is 

-  M  =  A  'B 

-  M  =  L  0A 


n 


-  M  =  A  01 


n 


-  Mis  a  permutation 

-  M  is  a  diagonal 

-  etc. 


apply  B,  then  A 

apply  A,  n  times  in  parallel 

apply  A,  n  times  in  parallel 

taking  inputs  at  stride  n 

permute  x 

scale  x 


♦  formulas  are  a  natural  HW  description 

♦  formulas  allow  manipulation 

♦  formulas  can  be  translated  into  Verilog 
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Outline 


♦  Introduction 

♦  Formula-Driven  Design  Generation 

♦  Microarchitecture  Parameterization 

♦  Resource  Parameterization 

♦  Experimental  Results 

♦  Conclusions 
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Pease  DFT 


♦  Simple  regular  structure  embodied  in  formula 


DFT 


where 


ATI  2  ^ 

Tk—i  —  Lyk—i—i 


(/2i  ®  D 


2k~i 

2k~i~ 


k 

i+ 1 


♦  Example 

dft8  =  (Ll(h  ®  F2)T3)  (i|(/4  ®  F2)T2) 

(i|(/4  ®  f2)t1)  f8 
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Pease  DFT  Example:  DFT8 


C  block 


Rs  Ti{Li  ®  F2)L%  T2(Ia  ®  F2)L%  23(74  ®  F2)L 

Repeating  column  structure  =>  hardware  reuse 

with  zero  perf.  penalty 
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to  OO 


controllable 

degree  of  parallelism  (p) 


Horizontally  Folded  Pease  DFT 
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V-folding  according  to  p 


n 

Latency  =  t(—  (l°g(n)  -  1)) 
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%=> 


V-folding  of  Stride  Permutations 


♦  Stride  Permutation 

k—q—1 

I%  =  I2k-q®Lf(  J^[  (I2fc. 

i= 0 

♦  In  other  words 


q  —  i  —  l 


</2g+i+i )) 


r 


k—q—l 

L'2  (  1 1  ( -^2 A.- — —  *  —  l  ®  «/2«+*+>  )) 
j=0 


k—q—l 

I/-}  (  1 1  (^fc  — 9-i-l  ®  J^q+i+l  )) 
i= 0 


n  /  2k-q 

-A- 


k—q—l 

Li2  (  1 1  ®  J^g+t+i  )) 

i=0 


[Taka! a,  etal.  ICASSP’2001] 
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V-folding  of  Stride  Permutations 


L 


2q 


k—q—1 

(  |  (/2/c  —  q  —  i  —  l  0  J2«+*+ 1)) 


2  =  0 


2p 

inputs 


[Takala,  etal.  ICASSP’2001] 
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FIFO:  BRAM  vs.  CLBs 


♦  J-matrix  FIFOs  are  a  significant  part  of  logic  resources 

♦  FIFOs  can  be  constructed  from 


-  shift  registers  using  CLB  slices,  or 

-  circular  buffers  using  CLB  slices  (distributed  RAM),  or 

-  circular  buffers  using  BRAM  memory  macros 


♦  “Exchange  rate”  of  shift  registers  vs.  circular  buffer 

1400 


1200 
1000 
jj  800 
w  600 

400 

200 

0 


□  D-RAM 
■  Shift  Reg 


!=■  i  m  i  ■  liy 


performance 
difference 
is  negligible 


FIFO  size 

Let  user  set  the  context-dependent  break-even  point 

Nordin,  Hoe,  Puschel,  CMU/ECE  HPEC  2004,  Slide  1 5 


Outline 


♦  Introduction 

♦  Formula-Driven  Design  Generation 

♦  Microarchitecture  Parameterization 

♦  Resource  Parameterization 

♦  Experimental  Results 

♦  Conclusions 


Nordin,  Hoe,  Puschel,  CMU/ECE 


HPEC  2004,  Slide  16 


Xilinx  LogiCore  Library 


♦  DFT  based  on  Radix-4  Cooley-Tukey 

-  range  of  sizes 

-  streaming  vs.  burst  I/O 

-  fixed-point  scaling  modes 

-  in/out  data  ordering 

♦  Evaluation 

-  DFT  of  64,  1024  and  2048 

-  burst  I/O  interface,  bit-reversed-ordering 

-  Xilinx  Virtex2-Pro  XC2VP1 00-6 

-  Xilinx  ISE  version  6.1. 03i 
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Slices 


dfTm 


Number  of  Slices  for  DFT64  Number  of  BRAMs  for  DFT64  Throughput  of  DFT64 


BRAM 

Slice 


Xilinx 
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Slices 


DFTW74  and  DFT 2048 


Number  of  Slices  for  DFT1024 


Number  of  Slices  for  DFT2048 


Number  of  BRAMs  for  DFT1024 

250 


Number  of  BRAMs  for  DFT2048 

350 

300 

250  / 

j-j  200  / 

a 

U  ISO 

jT 

100 


1  2  A  8  16  32 

P 


Throughput  of  DFT1024 


Nordin,  Hoe,  Piischel,  CMU/ECE 


-♦ —  Min  Slice 
- Xilinx 
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♦ 


Work 


Kumhom,  Johnson,  Nagvajara,  ASIC/SOC  2000 

-  universal  FFT  processor  microarchitecture  based  on 


processing  elements  interconnected  by  on-chip 
reconfigurable  network 

-  microarchitecture  is  scalable  in  the  number  of  elements 

-  supports  both  Cooley  Tukey  and  Pease 

Choi,  Scrofano,  Prasanna,  Jang,  FPGA’2003 

-  mapped  radix-4  Cooley-Tukey  algorithm  onto  log(n)/2  DFT4 
primitives 

-  scalable  datapath  between  1  element  and  4  elements  at  a 
time 

-  show  energy  and  performance  improvements  from  scaling 

-  does  not  show  same  tradeoff  point  as  Xilinx  can  be  covered 


♦ 
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Conclusions 


♦  Parameterized  IP  Generator 

-  easy  to  use 

-  allows  customization 

♦  Prototype  implementation  of  DFT  generator 

-  parameterized  performance/cost  tradeoff 

-  parameterized  resource  usage  preference 

♦  Key  results 

-  generator  is  efficient,  i.e.,  the  Xilinx  design  point  can  be 
matched 

-  customization  allows  advantage  in  a  chosen  dimension 
relative  to  Xilinx  DFT  cores 
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